Client Report - Finding Relationships in Baseball

Using pandas, numpy, sqlite3, lets_plot

Author

[Maia Faith Chambers]

Show the code
import pandas as pd 
import numpy as np
import sqlite3
from lets_plot import *

LetsPlot.setup_html(isolated_frame=True)
sqlite_file = 'lahmansbaseballdb.sqlite'
con = sqlite3.connect(sqlite_file)

Client Request:

The client would like SQL queries that can be used to retrieve baseball data for direct use on their website, without requiring Python processing. They also request that the results be visualized with Lets-Plot charts to highlight key patterns and insights in the data.

Summary of overall project work

This project explored Major League Baseball data using SQL (sqlite3) for querying, and Python (pandas, numpy, lets-plot) for data analysis and visualization. The client requested reusable SQL queries for their website along with charts that highlight trends and insights.

* BYU-Idaho Alumni in MLB: Queried salary and career data for two alumni (Mark Lindsma and Garrett Stephenson) and visualized their career earnings over time.

* Batting Averages: Calculated batting averages at the season level (1+ at-bats, 10+ at-bats) and at the career level (100+ at-bats). Results surfaced consistent top performers such as Ty Cobb and Rogers Hornsby, showing the importance of applying thresholds to filter out statistical flukes.

* Team Salary Comparison: Compared the Seattle Mariners and Texas Rangers average salaries across decades, showing steady growth for both but sharper spikes for the Rangers. This analysis raised questions about the link between payroll size and performance.

Additional Analyses:

* Salary Distribution by Position: Used a case statement to categorize positions into high, medium, or low salary groups, showing how financial value is distributed by role.

* Career Longevity: Identified the top 10 longest MLB careers (10+ games played) and calculated the league-wide average career length, providing insights into player durability.

Tools/Packages: SQL (sqlite3), Python (pandas, numpy), Lets-Plot

Show the code
# Learn morea about Code Cells: https://quarto.org/docs/reference/cells/cells-jupyter.html

BYU-Idaho ALUMNI IN MAJOR LEAGUE BASEBALL

The query below returns the results of two players who attended BYU-Idaho: Mark L. Lindsma and Garrett Stephenson, identified by their player IDs as lindsma01 and stephga01, respectively. Both players had long and varied careers in Major League Baseball. Lindsma played from 2007 to 2014, while Stephenson’s career spanned from 1997 to 2003. Lindsma’s salary peaked later in his career, earning $4,000,000 in 2014 with the Chicago White Sox. In contrast, Stephenson reached his highest salary of $1,025,000 in 2001, during the mid-point of his career with the St. Louis Cardinals. This analysis highlights the financial success of these two BYU-Idaho alumni in the professional baseball world.

Show the code
# Query salary info for BYU-Idaho players using known playerIDs
query_byu_players = """
SELECT 
    playerID, 
    salary,
    yearID,
    teamID
FROM 
    Salaries
WHERE 
    playerID IN ('lindsma01', 'stephga01')
ORDER BY 
    salary DESC;
"""
df_baseball1 = pd.read_sql_query(query_byu_players, con)
df_baseball1["schoolID"] = "idbyuid"
df_baseball1
playerID salary yearID teamID schoolID
0 lindsma01 4000000.0 2014 CHA idbyuid
1 lindsma01 3600000.0 2012 BAL idbyuid
2 lindsma01 2800000.0 2011 COL idbyuid
3 lindsma01 2300000.0 2013 CHA idbyuid
4 lindsma01 1625000.0 2010 HOU idbyuid
5 stephga01 1025000.0 2001 SLN idbyuid
6 stephga01 900000.0 2002 SLN idbyuid
7 stephga01 800000.0 2003 SLN idbyuid
8 stephga01 550000.0 2000 SLN idbyuid
9 lindsma01 410000.0 2009 FLO idbyuid
10 lindsma01 395000.0 2008 FLO idbyuid
11 lindsma01 380000.0 2007 FLO idbyuid
12 stephga01 215000.0 1999 SLN idbyuid
13 stephga01 185000.0 1998 PHI idbyuid
14 stephga01 150000.0 1997 PHI idbyuid
Show the code
#This part is extra, I wanted to see how it would look in a lets plot showing the comparrison of salary over the two players careers.
# Line plot to visualize their salary over time
query_compare_salaries = """
SELECT 
    playerID, 
    yearID, 
    salary
FROM 
    Salaries
WHERE 
    playerID IN ('lindsma01', 'stephga01')
ORDER BY 
    yearID;
"""

df_salaries = pd.read_sql_query(query_compare_salaries, con)

player_colors = {
    'lindsma01': '#27251F',
    'stephga01': '#C41E3A'
}

ggplot(df_salaries, aes(x='yearID', y='salary', color=as_discrete('playerID'))) + \
    geom_line(size=2) + \
    scale_color_manual(values=player_colors) + \
    scale_x_continuous(breaks=list(range(df_salaries['yearID'].min(), df_salaries['yearID'].max()+1))) + \
    ggtitle("Salary Comparison: Mark Lindsma vs Garrett Stephenson") + \
    xlab("Year") + \
    ylab("Salary (USD)") + \
    theme_minimal()

BATTING AVERAGES: SEASON AND CAREER ANALYSIS

Client Request specific
a. Write an SQL query that provides playerID, yearID, and batting average for players with at least 1 at bat that year. Sort the table from highest batting average to lowest, and then by playerid alphabetically. Show the top 5 results in your report.
a. Use the same query as above, but only include players with at least 10 at bats that year. Print the top 5 results.
a. Now calculate the batting average for players over their entire careers (all years combined). Only include players with at least 100 at bats, and print the top 5 results.

This three-part question looked at batting averages, which are calculated by dividing hits by at-bats. For the first part, I found the top batting averages for players who had at least one at-bat in a season. Not surprisingly, all five players had a perfect 1.000 average, meaning they got a hit every time they were officially at bat that year. But most of them probably only had one or two chances, so the data isn’t super meaningful yet.

Show the code
query_2a = """
SELECT 
    playerID, 
    yearID, 
    ROUND(CAST(H AS FLOAT)/AB, 3) AS batting_avg
FROM 
    Batting
WHERE 
    AB > 0
ORDER BY 
    batting_avg DESC, playerID ASC
LIMIT 5;
"""

df_2a = pd.read_sql_query(query_2a, con)
df_2a
playerID yearID batting_avg
0 aberal01 1957 1.0
1 abernte02 1960 1.0
2 abramge01 1923 1.0
3 acklefr01 1964 1.0
4 alanirj01 2019 1.0

In the second part, I made the filter a bit stricter by only including players with at least 10 at-bats. This helped bring out more reliable results. The top players now had averages between .571 and .643, with names like Manny Ny (1974) and Carson M. (2013) leading the list. These guys still had great seasons, but the extra filter helps avoid flukes.

Show the code
query_2b = """
SELECT 
    playerID, 
    yearID, 
    ROUND(CAST(H AS FLOAT)/AB, 3) AS batting_avg
FROM 
    Batting
WHERE 
    AB >= 10
ORDER BY 
    batting_avg DESC, playerID ASC
LIMIT 5;
"""

df_2b = pd.read_sql_query(query_2b, con)
df_2b
playerID yearID batting_avg
0 nymanny01 1974 0.643
1 carsoma01 2013 0.636
2 altizda01 1910 0.600
3 johnsde01 1975 0.600
4 silvech01 1948 0.571

For the third part, I calculated batting averages over entire careers by summing up all hits and at-bats for each player and then dividing. I also made sure to only include players with at least 100 at-bats total. The results brought out legends like Ty Cobb (.366), Rogers Hornsby (.358), and Joe Jackson (.356). These are the kind of averages that reflect long-term consistency, not just one standout season.

Show the code
query_2c = """
SELECT 
    playerID, 
    ROUND(SUM(CAST(H AS FLOAT))/SUM(AB), 3) AS career_batting_avg
FROM 
    Batting
GROUP BY 
    playerID
HAVING 
    SUM(AB) >= 100
ORDER BY 
    career_batting_avg DESC, playerID ASC
LIMIT 5;
"""

df_2c = pd.read_sql_query(query_2c, con)
df_2c
playerID career_batting_avg
0 cobbty01 0.366
1 barnero01 0.360
2 hornsro01 0.358
3 jacksjo01 0.356
4 meyerle01 0.356

TEAM SALARY COMPARISON: MARINERS VS. RANGERS

I used the Seattle Mariners and the Texas rangers using average salary by year as my metric. The SQL query groups the player salaries by team and year and then calculates the average salary for each. The results are from the range years of 1985 to 2016, during this time the avaerage salary was over $6 million. The Marineres showed growth but were slightly behind in the end. This comparison gave me a better sense of how each organization has financially valued its players over time. It also raises questions about how payroll size might relate to team performance, although I’m a mariners fan, I realize we lost many games so it would be interesting to see a performance to salary analysis.

Show the code
query_task3_mariners_rangers = """
SELECT 
    teamID, 
    yearID, 
    ROUND(AVG(salary), 2) AS avg_salary
FROM 
    Salaries
WHERE 
    teamID IN ('SEA', 'TEX')
GROUP BY 
    teamID, yearID
ORDER BY 
    yearID;
"""

df_task3_mr = pd.read_sql_query(query_task3_mariners_rangers, con)
df_task3_mr
teamID yearID avg_salary
0 SEA 1985 256277.78
1 TEX 1985 383825.00
2 SEA 1986 229165.73
3 TEX 1986 259350.73
4 SEA 1987 251500.00
... ... ... ...
59 TEX 2014 4677294.13
60 SEA 2015 4888348.00
61 TEX 2015 4791426.30
62 SEA 2016 4845833.54
63 TEX 2016 6070300.79

64 rows × 3 columns

The chart shows that both teams started with lower salaries in the 1980s, but average pay steadily increased over time. While both teams experienced ups and downs, the Texas Rangers had more dramatic spikes and sharper increases compared to the Mariners. The Mariners’ salary growth was more consistent and gradual, whereas the Rangers ended up on top with more noticeable jumps throughout the years.

Show the code
#the plot
from lets_plot.mapping import as_discrete

team_colors = {
    'SEA': '#0C2C56',  # Mariners - Navy
    'TEX': '#C0111F'   # Rangers - Red
}

ggplot(df_task3_mr, aes(x='yearID', y='avg_salary', color=as_discrete('teamID'))) + \
    geom_line(size=1.5) + \
    scale_color_manual(values=team_colors) + \
    scale_x_continuous(breaks=list(range(df_task3_mr['yearID'].min(), df_task3_mr['yearID'].max()+1))) + \
    ggtitle("Average Salary: Seattle Mariners vs Texas Rangers") + \
    xlab("Year") + \
    ylab("Average Salary (USD)") + \
    theme_minimal()

Salary Distribution by Position:

Show the code
# Include and execute your code here
salary_by_position_query = """
WITH PrimaryPosition AS (
    SELECT 
        playerID,
        yearID,
        POS,
        MAX(G) AS games_played
    FROM (
        SELECT 
            playerID, 
            yearID, 
            POS, 
            SUM(G) AS G
        FROM 
            Fielding
        GROUP BY 
            playerID, yearID, POS
    )
    GROUP BY 
        playerID, yearID
    HAVING 
        G = MAX(G)
),

PositionSalary AS (
    SELECT 
        pp.POS AS position,
        s.playerID,
        s.salary
    FROM 
        Salaries s
    JOIN 
        PrimaryPosition pp ON s.playerID = pp.playerID AND s.yearID = pp.yearID
)

SELECT 
    position,
    ROUND(AVG(salary), 2) AS average_salary,
    COUNT(DISTINCT playerID) AS total_players,
    MAX(salary) AS highest_salary,
    CASE 
        WHEN AVG(salary) > 3000000 THEN 'High Salary'
        WHEN AVG(salary) BETWEEN 2000000 AND 3000000 THEN 'Medium Salary'
        ELSE 'Low Salary'
    END AS salary_category
FROM 
    PositionSalary
GROUP BY 
    position
ORDER BY 
    average_salary DESC;
"""

df_salary_by_position = pd.read_sql_query(salary_by_position_query, con)
df_salary_by_position
position average_salary total_players highest_salary salary_category
0 1B 3336210.69 469 28000000.0 High Salary
1 OF 2405474.54 1119 27328046.0 Medium Salary
2 3B 2321342.06 491 33000000.0 Medium Salary
3 SS 2010018.49 368 22600000.0 Medium Salary
4 P 1938130.80 2556 33000000.0 Low Salary
5 2B 1794492.33 485 24000000.0 Low Salary
6 C 1429781.30 404 23000000.0 Low Salary

Career Longevity and Top 10 Longest Careers:

Show the code
career_query = """
WITH CareerSpan AS (
    SELECT 
        a.playerID,
        MIN(a.yearID) AS start_year,
        MAX(a.yearID) AS end_year,
        (MAX(a.yearID) - MIN(a.yearID) + 1) AS career_length,
        SUM(a.G_all) AS total_games
    FROM 
        Appearances a
    GROUP BY 
        a.playerID
    HAVING 
        total_games >= 10
),

TopCareers AS (
    SELECT 
        cs.playerID,
        p.nameFirst AS first_name,
        p.nameLast AS last_name,
        cs.career_length
    FROM 
        CareerSpan cs
    JOIN 
        People p ON cs.playerID = p.playerID
    ORDER BY 
        cs.career_length DESC
    LIMIT 10
)

SELECT 
    * 
FROM 
    TopCareers;
"""

df_top_careers = pd.read_sql_query(career_query, con)
df_top_careers
playerID first_name last_name career_length
0 altroni01 Nick Altrock 36
1 orourji01 Jim O'Rourke 33
2 minosmi01 Minnie Minoso 32
3 olearch01 Charley O'Leary 31
4 lathaar01 Arlie Latham 30
5 mcguide01 Deacon McGuire 29
6 eversjo01 Johnny Evers 28
7 jennihu01 Hughie Jennings 28
8 ryanno01 Nolan Ryan 28
9 streega01 Gabby Street 28
Show the code
avg_career_query = """
WITH CareerSpan AS (
    SELECT 
        playerID,
        MIN(yearID) AS start_year,
        MAX(yearID) AS end_year,
        (MAX(yearID) - MIN(yearID) + 1) AS career_length,
        SUM(G_all) AS total_games
    FROM 
        Appearances
    GROUP BY 
        playerID
    HAVING 
        total_games >= 10
)

SELECT 
    ROUND(AVG(career_length), 2) AS avg_career_length
FROM 
    CareerSpan;
"""

df_avg_career = pd.read_sql_query(avg_career_query, con)
df_avg_career
avg_career_length
0 6.84